Take Home Exercise 2 - Regionalization of Nigeria with Water points

Author

Allan Chong

Overview

Water is a crucial resource for humanity. People must have access to clean water in order to be healthy. It promotes a healthy environment, peace and security, and a sustainable economy. However, more than 40% of the world’s population lacks access to enough clean water. According to UN-Water, 1.8 billion people would live in places with a complete water shortage by 2025. One of the many areas that the water problem gravely threatens is food security. Agriculture uses over 70% of the freshwater that is present on Earth.

The severe water shortages and water quality issues are seen in underdeveloped countries. Up to 80% of infections in developing nations are attributed to inadequate water and sanitation infrastructure.

Despite technological advancement, providing rural people with clean water continues to be a key development concern in many countries around the world, especially in those on the continent of Africa.

The spatial patterns of non-functional water points will be shown in this study by using the proper global and local spatial association methodologies. We look at Nigeria’s in this assignment.

Data points of interest

In this assignment, we will attempt to regionalize Nigeria based on the following variables:

  • Total number of functional water points

  • Total number of nonfunctional water points

  • Percentage of functional water points

  • Percentage of non-functional water points

  • Percentage of main water point technology (i.e. Hand Pump)

  • Percentage of usage capacity (i.e. < 1000, >=1000)

  • Percentage of rural water points

  • Percentage of potable vs non potable water points

  • Percentage of water points accessible within median of primary road network

  • Percentage of water points accessible within median of secondary road network

  • Percentage of water points accessible within median of tertiary road network

  • Percentage of water points that are managed

Getting Started

First, we load the required packages in R

  • Spatial data handling

    • sf, rgdal and spdep
  • Attribute data handling

    • knitr, tidyverse, funModeling especially readr, ggplot2 and dplyr
  • Choropleth mapping

    • tmap
  • Multivariate data visualization and analysis

    • coorplot, ggpubr, and heatmaply
  • Cluster analysis

    • cluster

    • ClustGeo

pacman::p_load(knitr, rgdal, spdep, tmap, sf, 
               ggpubr, cluster, funModeling,
               factoextra, NbClust, #factoextra factor analysis, access clustering results
               heatmaply, corrplot, psych, tidyverse)

Spatial Data

The spatial dataset used in this assignment is the Nigeria Level-2 Administrative Boundary spatial dataset downloaded from Geoboundaries

We will load the spatial features by using st_read() from the sf package

As the data we want is in WSG-84 format, we set crs to 4326.

We won’t utilize st_transform() at this time because it can result in outputs with missing points after transformation, which would skew our study.

nga = st_read(dsn = "data/geospatial",
               layer = "geoBoundaries-NGA-ADM2",
               crs = 4326)
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\Allanckw\ISSS624\Take-Home_Ex2\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

We could use st_crs()to verify the coordinate system from the object.

st_crs(nga)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

Updating Spatial features that have identical name but are in different states

The following code determines whether any LGA names have been repeated. If the shapeName is not distinct for rows, duplicated() returns True. Only rows that satisfy the duplicated rows = True criterion are returned by subset() (Ong, 2022).

duplicate = nga$shapeName[nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)]]

duplicate
 [1] "Bassa"    "Bassa"    "Ifelodun" "Ifelodun" "Irepodun" "Irepodun"
 [7] "Nasarawa" "Nasarawa" "Obi"      "Obi"      "Surulere" "Surulere"

From the result, there are 12 LGAs that have the same names, even though they are from different states. We will want to rename them or it would be confusing when we conduct further analysis.

Firstly, we create a data frame with the duplicate

nga_dupes = nga %>%
  filter(shapeName %in% duplicate)

Calling ttm() in the tmap package will switch the tmap’s viewing mode to interactive viewing, which will help us find the states of the respective duplicates in the map. We then plot the map with tmap functions

ttm()
tm_shape(nga_dupes) + 
  tm_polygons("shapeName") +
  tm_borders(alpha=0.5) 

From the result, we can observe the following:

shapeID shapeName
NGA-ADM2-72505758B95534398 Bassa (Kogi)
NGA-ADM2-72505758B52690633 Bassa (Plateau)
NGA-ADM2-72505758B26581542 Ifelodun (Kwara)
NGA-ADM2-72505758B18326272 Ifelodun (Osun)
NGA-ADM2-72505758B75034141 Irepodun (Kwara)
NGA-ADM2-72505758B79178637 Irepodun (Osun)
NGA-ADM2-72505758B6786568 Nasarawa (Kano)
NGA-ADM2-72505758B67188591 Nasarawa (Nasarawa)
NGA-ADM2-72505758B7318634 Obi (Benue)
NGA-ADM2-72505758B3073896 Obi (Nasarawa)
NGA-ADM2-72505758B6675111 Surulere (lagos)
NGA-ADM2-72505758B31597260 Surulere (Oyo)

Based on the results above, we replace the shape names with shapeID as the identifier and run the duplicate check again to verify that it is now empty

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B95534398'] = 'Bassa (Kogi)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B52690633'] = 'Bassa (Plateau)'

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B26581542'] = 'Ifelodun (Kwara)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B18326272'] = 'Ifelodun (Osun)'

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B75034141'] = 'Irepodun (Kwara)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B79178637'] = 'Irepodun (Osun)'

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B6786568'] = 'Nasarawa (Kano)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B67188591'] = 'Nasarawa (Nasarawa)'

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B7318634'] = 'Obi (Benue)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B3073896'] = 'Obi (Nasarawa)'

nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B6675111'] = 'Surulere (lagos)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B31597260'] = 'Surulere (Oyo)'

duplicate = nga$shapeName[nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)]]

duplicate
character(0)

Aspatial Data

Cleaning the Data

The aspatial dataset used in this assignment is the water point data exchange dataset found in WPdx Global Data Repositories. Data is filtered on the web portal to only keep Nigeria and the file is saved as NigeriaWaterPoints_Raw.csv

As we are only interested in the functionality of the water point, it is important to capture fields that may may aid us in our analysis (Definition are found here: Source)

  • LGA: The area we are interested in

  • State: The state of the LGA of Nigeria

  • Functional: Whether it is functional or not

  • Management: who manages it?

  • Quality: what is the quality?

  • Water Source Category: where the water came from?

  • Water Tech Category: What technology is used?

  • is_urban: Is it in an urban area?

  • #distance_to_primary_road: Based on calculations with data from OpenStreetMap, the distance in km to the nearest road.

  • #distance_to_secondary_road: Based on calculations using OpenStreetMap data, the distance in km to the second-closest road.

  • #distance_to_tertiary_road: Using calculations with data from OpenStreetMap, the distance to the third-closest road is given in km.

  • Usage Capacity: Maximum recommended number of users per waterpoint

  • New Georeferenced Column: : Well Known Text (wkt) representing spatial data in a textual format

To load the raw data file, we use the read_csv function

wpdx_raw = read_csv("data/aspatial/NigeriaWaterPoints_Raw.csv") 

Most of the columns are irrelevant, so we will perform the following:

  • keep the columns we want to clean it up by specifying the columns with one to retain with subset

  • renaming the columns using rename_with

  • Replace all the NA with unknown for columns with NA value present

retain_cols = c('#clean_adm2', '#clean_adm1', '#status_clean', '#management_clean'
                , '#subjective_quality', '#water_source_category', 
                '#water_tech_category', 
                'New Georeferenced Column', 'is_urban','#distance_to_primary_road',
                '#distance_to_secondary_road', '#distance_to_tertiary_road'
                ,'usage_capacity')

new_col_names = c('LGA', 'State', 'Functional', 'Management', 
                  'Quality',  'Water_Source_Category', 'Water_Tech_Category',
                  'WKT', 'Is_Urban_Area', 'dist_to_primary_road'
                  ,'dist_to_secondary_road', 'dist_to_tertiary_road'
                  ,'usage_capacity')

wpdx_clean = subset(wpdx_raw, select = (names(wpdx_raw) %in% retain_cols)) %>%  rename_with(~ new_col_names, all_of(retain_cols)) %>% 
replace_na(list(Functional = "Unknown", Management = "Unknown", Quality = "Unknown", Water_Source_Category = "Unknown", Water_Tech_Category = "Unknown"))

We save the processed data frame wpdx_clean into a file, the file will be reduced to 3.4 MB from the 144 MB raw file that we downloaded.

saveRDS(wpdx_clean, "data/aspatial/wpdx_clean_ex2.rds")

We can then delete the raw file from the project and retrieve the saved RDS file using readRDS()

wpdx_clean = readRDS("data/aspatial/wpdx_clean_ex2.rds")

Converting csv data into spatial features

We can use st_as_sfc() to come up with the new field Geometry by using the WKT field

wpdx_clean$Geometry = st_as_sfc(wpdx_clean$`WKT`)

We will then use st_sf() to convert the tibble data frame into sf data frame. The EPSG 4326 code is used as the dataset is referencing WGS84 geographic coordinate system.

We could use st_crs()to verify the coordinate system from the object.

wpdx_clean_sf = st_sf(wpdx_clean, crs=4326) 
st_crs(wpdx_clean_sf)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

We can then use glimpse() to verify each field’s data type & available values.

glimpse(wpdx_clean_sf)
Rows: 95,008
Columns: 14
$ Water_Source_Category  <chr> "Unknown", "Well", "Well", "Well", "Well", "Wel…
$ Water_Tech_Category    <chr> "Tapstand", "Mechanized Pump", "Hand Pump", "Un…
$ State                  <chr> "Ekiti", "Ogun", "Ebonyi", "Enugu", "Enugu", "B…
$ LGA                    <chr> "Moba", "Obafemi-Owode", "Ohaukwu", "Isi-Uzo", …
$ Management             <chr> "Unknown", "Other", "Unknown", "Unknown", "Unkn…
$ Functional             <chr> "Unknown", "Functional", "Unknown", "Unknown", …
$ Quality                <chr> "Unknown", "Acceptable quality", "Unknown", "Un…
$ dist_to_primary_road   <dbl> 767.3742, 13364.9005, 9492.7619, 9319.0815, 543…
$ dist_to_secondary_road <dbl> 921.79187, 48.87743, 4333.34280, 23276.33227, 1…
$ dist_to_tertiary_road  <dbl> 3146.733237, 4167.519068, 693.211204, 307.71619…
$ usage_capacity         <dbl> 250, 1000, 300, 300, 300, 300, 300, 1000, 300, …
$ Is_Urban_Area          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ WKT                    <chr> "POINT (5.12 7.98)", "POINT (3.5976683 6.964531…
$ Geometry               <POINT [°]> POINT (5.12 7.98), POINT (3.597668 6.9645…

Lastly, the attribute data from the nga sf data frame will be transferred into the wpdx_clean_sf data frame using a geoprocessing function known as point-in-polygon overlay. st_join() from the sf package can help us with that

nga_wp = st_join(wpdx_clean_sf, nga)

Exploratory Data Analysis (EDA)

We can use freq() of the funModeling package to display the distribution of our data points of interest using wpdx_clean_sf. This is to help us aggregate the data as the dataset provide breakdowns of their respective catagories.

Categorical Variables

freq(data=nga_wp, input = 'Functional')

                        Functional frequency percentage cumulative_perc
1                       Functional     45883      48.29           48.29
2                   Non-Functional     29385      30.93           79.22
3                          Unknown     10656      11.22           90.44
4      Functional but needs repair      4579       4.82           95.26
5 Non-Functional due to dry season      2403       2.53           97.79
6        Functional but not in use      1686       1.77           99.56
7         Abandoned/Decommissioned       234       0.25           99.81
8                        Abandoned       175       0.18           99.99
9 Non functional due to dry season         7       0.01          100.00

Checking out functionality, we know that functional water points are broken down into Functional, Functional but needs repair, and Functional but not in use. 48.29% are functional, 4.82% are functional but needs repair and 1.77% of them are functional but not in use.

freq(data=nga_wp, input = 'Management')

                             Management frequency percentage cumulative_perc
1                               Unknown     35917      37.80           37.80
2                  Community Management     31655      33.32           71.12
3           Direct Government Operation     16871      17.76           88.88
4                                 Other      8361       8.80           97.68
5                     School Management      1435       1.51           99.19
6                  Health Care Facility       440       0.46           99.65
7 Private Operator/Delegated Management       194       0.20           99.85
8        Other Institutional Management       135       0.14          100.00

Checking out water point management, we know that water points are mostly managed except 37.8% which are unknowns

freq(data=nga_wp, input = 'Quality')

                              Quality frequency percentage cumulative_perc
1                  Acceptable quality     71801      75.57           75.57
2                             Unknown     10625      11.18           86.75
3                 No because of Taste      6712       7.06           93.81
4                No because of Colour      3986       4.20           98.01
5                 No because of Odour      1847       1.94           99.95
6    Within National limits (potable)        19       0.02           99.97
7 Within National standards (potable)        18       0.02          100.00

Checking out the quality of water points, we know that 75.57% of them are of acceptable quality, with an additional 0.04% of them within potable national limits / standards. The rest are either unknown or of unacceptable quality.

freq(data=nga_wp, input = 'Water_Source_Category')

  Water_Source_Category frequency percentage cumulative_perc
1                  Well     91540      96.35           96.35
2                Spring      3163       3.33           99.68
3               Unknown       302       0.32          100.00
4           Piped Water         3       0.00          100.00

Checking out the source of water points, we know that majority of them (96.35%) comes from well. With such a high number, this variable may not be useful for our analysis as it most of the data points will come have this value.

freq(data=nga_wp, input = 'Water_Tech_Category')

  Water_Tech_Category frequency percentage cumulative_perc
1           Hand Pump     58755      61.84           61.84
2     Mechanized Pump     25644      26.99           88.83
3             Unknown     10055      10.58           99.41
4            Tapstand       553       0.58           99.99
5     Rope and Bucket         1       0.00          100.00

Checking out the technology the water point uses, we know that water points are broken down into Hand pumps, Mechanized Pump, Tapstand, Rope and bucket. 61.84% of water points operates on hand pumps, 26.99% on mechanized pumps, 10.58% are unknowns and a minority of them (less than 0.58%) are either on tapstand or Rope and Bucket.

freq(data=nga_wp, input = 'Is_Urban_Area')

  Is_Urban_Area frequency percentage cumulative_perc
1         FALSE     75444      79.41           79.41
2          TRUE     19564      20.59          100.00

From the records, only 20.59% of the water points are in urban areas, while the rest of them (79.41%) are in the rural areas.

freq(data=nga_wp, input = 'usage_capacity')

  usage_capacity frequency percentage cumulative_perc
1            300     68789      72.40           72.40
2           1000     25644      26.99           99.39
3            250       573       0.60           99.99
4             50         2       0.00          100.00

Majority of the water points caters to 300 people or less (73.1%), while 26.99% of the water points caters to a capacity of 1000 people

Continuous Variables

We will need to find out the summary of the respective distance in order to categorize them appropriately for analysis. We can achieve that by using summary statistics in R

summary(nga_wp$dist_to_primary_road)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.01  1361.68  6647.50 10177.43 15132.29 82666.56 
summary(nga_wp$dist_to_secondary_road)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   788.6  4446.0  7934.7 11083.7 94773.0 
summary(nga_wp$dist_to_tertiary_road)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   261.5  1442.4  3853.2  5114.6 58874.0 

Based on the results above, we know that the median distance (in km) to primary, secondary and tertiary roads is 6647.50 km, 4446.0 km and 1442.4 km respectively. We can use within median distance to analyze if accessibility to water points is a factor to regionalization

Aggregate the Data

To aggregate the variable of interest, we will create new data frames to store them by using the filter function. Variables names used are self explanatory.

func_list = c("Functional", "Functional but needs repair", "Functional but not in use")
wpt_functional_true = wpdx_clean_sf %>%
  filter(Functional %in% func_list)

wpt_functional_false = wpdx_clean_sf %>%
  filter(!Functional %in% c(func_list, "Unknown"))

wpt_rural = wpdx_clean_sf %>%
  filter(Is_Urban_Area == FALSE)

wpt_handpumps_true = wpdx_clean_sf %>%
  filter(Water_Tech_Category %in% "Hand Pump")

wpt_handpumps_false = wpdx_clean_sf %>%
  filter(!Water_Tech_Category %in% "Hand Pump")

wpt_potable_true = wpdx_clean_sf %>%
  filter(Quality %in% c("Acceptable quality", "Within National standards (potable)
", "Within National limits (potable)"))

wpt_potable_false = wpdx_clean_sf %>%
  filter(!Quality %in% c("Acceptable quality", "Within National standards (potable)
", "Within National limits (potable)", "Unknown"))

wpt_potable_unknown = wpdx_clean_sf %>%
  filter(!Quality %in% "Unknown")

wpt_usageOver1000 = wpdx_clean_sf %>%
  filter(usage_capacity >= 1000)

wpt_usageUnder1000 = wpdx_clean_sf %>%
  filter(usage_capacity < 1000)

wpt_accessibility_PriRd_lessThanMedian = wpdx_clean_sf %>%
  filter(dist_to_primary_road < 6647.50)
  
wpt_accessibility_PriRd_moreThanMedian = wpdx_clean_sf %>%
  filter(dist_to_primary_road >= 6647.50)

wpt_accessibility_secRd_lessThanMedian = wpdx_clean_sf %>%
  filter(dist_to_secondary_road < 4446.0  )
  
wpt_accessibility_secRd_moreThanMedian = wpdx_clean_sf %>%
  filter(dist_to_secondary_road >= 4446.0  )
  
wpt_accessibility_TerRd_lessThanMedian = wpdx_clean_sf %>%
  filter(dist_to_tertiary_road < 1442.4  )
  
wpt_accessibility_TerRd_moreThanMedian = wpdx_clean_sf %>%
  filter(dist_to_tertiary_road >= 1442.4  )

wpt_managed_true = wpdx_clean_sf %>%
  filter(!Management %in% "Unknown")

wpt_managed_unknown = wpdx_clean_sf %>%
  filter(Management %in% "Unknown")

Computing our data points of interest

We can use st_intersects() to find common data points between geographical datasets. In our case we need to find the common points in the Nigeria’s LGA spatial dataset and the water point aspatial dataset

The below code adds new columns to compute the total number of water points of our data point of interests

As an example, the below code intersects the Nigeria LGA dataset (nga_wp dataframe) with the water point dataset (wpdx_clean_sf dataframe) and produce a new column to denote the total number of water points in the area (Total wpt) by using mutate() and lengths()

nga_wp = nga %>% 
  #combine nga with water point sf
  mutate(`total wpt` = lengths(
    st_intersects(nga, wpdx_clean_sf)))

Similarly, for the rest of the variables, we piped the output and add the new columns to denote the number of

  • water points are functional, non functional and unknown

  • Water points that are in rural areas

  • Water points that either uses Hand pumps or otherwise

  • Water points that have either potable or non potable water

  • Water points with usage capacity of over 1000 persons or under

  • Water points within or over Median distance of Primary, Secondary and Tertiary roads

  • Water points management that are managed

nga_wp = nga %>% 
  #combine nga with water point sf
  mutate(`total wpt` = lengths(
    st_intersects(nga, wpdx_clean_sf))) %>%
  #add columns to produce no. of functional, non functional and unknown points
  mutate(`wpt functional` = lengths(
    st_intersects(nga, wpt_functional_true))) %>%
  mutate(`wpt non functional` = lengths(
    st_intersects(nga, wpt_functional_false))) %>%
  # rural
  mutate(`isRural` = lengths(
    st_intersects(nga, wpt_rural))) %>%
  # hand pumps
  mutate(`Uses Handpumps` = lengths(
    st_intersects(nga, wpt_handpumps_true))) %>%
  # Potable
   mutate(`Potable` = lengths(
    st_intersects(nga, wpt_potable_true))) %>%
  mutate(`Non Potable` = lengths(
    st_intersects(nga, wpt_potable_false))) %>%
  # Usage Capacity
  mutate(`usage Over 1000` = lengths(
    st_intersects(nga, wpt_usageOver1000))) %>%
  mutate(`usage Under 1000` = lengths(
    st_intersects(nga, wpt_usageUnder1000))) %>%
  #Primary Road
  mutate(`Within Median Distance to Pri Road` = lengths(
    st_intersects(nga, wpt_accessibility_PriRd_lessThanMedian))) %>%
  mutate(`Exceed Median Distance to Pri Road` = lengths(
    st_intersects(nga, wpt_accessibility_PriRd_moreThanMedian))) %>%
  #Secondary Road
  mutate(`Within Median Distance to Sec Road` = lengths(
    st_intersects(nga, wpt_accessibility_secRd_lessThanMedian))) %>%
  mutate(`Exceed Median Distance to Sec Road` = lengths(
    st_intersects(nga, wpt_accessibility_secRd_moreThanMedian))) %>%
  #Tertiary Road
  mutate(`Within Median Distance to Ter Road` = lengths(
    st_intersects(nga, wpt_accessibility_PriRd_lessThanMedian))) %>%
  mutate(`Exceed Median Distance to Ter Road` = lengths(
    st_intersects(nga, wpt_accessibility_TerRd_moreThanMedian))) %>%
  #management
  mutate(`Managed` = lengths(
    st_intersects(nga, wpt_managed_true))) 

Once we are done with the raw data, we use the below code to compute the respective percentages

  • percentage of functional water points

  • percentage of non functional water points

  • percentage of water points with hand pumps as technology

  • percentage of water points with potable / non potable water

  • percentage of water points with capacity over/under 1000

  • percentage of water points that are managed

  • percentage of water points in rural areas

  • percentage of water points are within the median distance to primary road network

  • percentage of water points are within the median distance to secondary road network

  • percentage of water points are within the median distance to tertiary road network

nga_wp = nga_wp %>%
  #add columns to compute %
  mutate(`pct_functional` = `wpt functional`/`total wpt`) %>%
  mutate(`pct_non-functional` = `wpt non functional`/`total wpt`) %>%

  mutate(`pct_Handpump` = `Uses Handpumps`/`total wpt`) %>%

  mutate(`pct_Potable` = `Potable`/`total wpt`) %>%
  mutate(`pct_NonPotable` = `Non Potable`/`total wpt`) %>%
  
  mutate(`pct_Cap_Over_1000` = `usage Over 1000`/`total wpt`) %>%
  mutate(`pct_Cap_Under_1000` = `usage Under 1000`/`total wpt`) %>%
    
  mutate(`pct_managed` = `Managed`/`total wpt`) %>%
  mutate(`pct_rural` = `isRural`/`total wpt`) %>%
  
  mutate(`pct_w_meddist_to_PriRoad` = `Within Median Distance to Pri Road`/`total wpt`) %>%
  
  mutate(`pct_w_meddist_to_SecRoad` = `Within Median Distance to Sec Road`/`total wpt`) %>%
  
  mutate(`pct_w_meddist_to_TerRoad` = `Within Median Distance to Ter Road`/`total wpt`) 

After we are done, we will inspect the data by viewing the data frame, we can observe that there are some rows that have NaN due to division by zero. Lets replace those values to 0 with the is.na() function and change the row names to the names of the LGAs with row.names() by using the code below

nga_wp[is.na(nga_wp)] = 0
row.names(nga_wp) = nga_wp$shapeName

We will only keep the following variables as the rest are intermediate data points that are no longer relevant

  1. wpt functional

  2. wpt non functional

  3. pct_functional

  4. pct_non-functional

  5. pct_Handpump

  6. pct_Potable

  7. pct_NonPotable

  8. pct_Cap_Over_1000

  9. pct_Cap_Under_1000

  10. pct_managed

  11. pct_rural

  12. pct_w_meddist_to_PriRoad

  13. pct_w_meddist_to_SecRoad

  14. pct_w_meddist_to_TerRoad

nga_wp_interested_data_pts = nga_wp  %>% 
      select(8:9, 23:34) 

We save the processed data frame nga_wp_interested_data_pts into a file with saveRDS(), so that we do not need to process it again

saveRDS(nga_wp_interested_data_pts, "data/geospatial/nga_wp_interested_data_pts.rds")

We can then retrieve the saved RDS file using readRDS()

nga_wp_interested_data_pts = readRDS("data/geospatial/nga_wp_interested_data_pts.rds")

EDA using statistical graphics

A Histogram is useful to identify the overall distribution of the data values (i.e. positively skew, negatively skew or normal distribution)

We use ggplot2’s histogram (geom_histogram) to plot the percentage functional and non functional water points with the mean and median as abline using geom_vline()

We plot them side by side using ggarrange

functional_pct_histo = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$pct_functional)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$pct_functional)),   
               color="brown", linetype="dashed", linewidth=1)

nonfunctional_pct_histo = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_non-functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Non Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_non-functional` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_non-functional`)),   
               color="brown", linetype="dashed", linewidth=1)


ggarrange(functional_pct_histo, nonfunctional_pct_histo, ncol = 2, nrow = 1)

From the result, we can observe that both functional and non functional water points distributions are positively skewed. The functional water point mean and median is close to 50%, whereas the non functional water points median and mean are very close to each other at 35.05% and 35.92% respectively

We can also use box plot to detect outliers with geom_boxplot. We plot them side by side using ggarrange

functional_boxplot = ggplot(data=nga_wp_interested_data_pts, 
       aes(x=`pct_functional`)) +
       labs(x = "% Functional") +
       geom_boxplot(color="black", fill="light blue")

nonfunctional_boxplot = ggplot(data=nga_wp_interested_data_pts, 
       aes(x=`pct_non-functional`)) + 
       labs(x = "% Non Functional") +
       geom_boxplot(color="black", fill="light blue") 
  

ggarrange(functional_boxplot, nonfunctional_boxplot, ncol = 2, nrow = 1)

We can see that there isn’t any outliers in the functional water points, but for the non functional water points, there is an outlier where 100% of the water points are not functional.

In the figure below, multiple histograms are plotted to reveal the distribution of the selected variables in the ict_derived data.frame. First, We do this by creating all the histograms assigned to individual variables.

functional = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `wpt functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "No. of Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`wpt functional`)), 
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`wpt functional`)), color="brown", linetype="dashed", linewidth=1)

nonfunctional = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `wpt non functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "No. of Non Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`wpt non functional` )),color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`wpt non functional`)),color="brown", linetype="dashed", linewidth=1)

functional_pct = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_functional` )),color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_functional`)), color="brown", linetype="dashed", linewidth=1)

nonfunctional_pct = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_non-functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Non Functional", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_non-functional` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_non-functional`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_Handpump = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_Handpump`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Handpump", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Handpump` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Handpump`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_Potable = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_Potable`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Potable", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Potable` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Potable`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_NonPotable = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_NonPotable`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Non Potable", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_NonPotable` )),   color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_NonPotable`)),    color="brown", linetype="dashed", linewidth=1)


pct_Usage_Capacity_Over_1000 = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_Cap_Over_1000`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Usage capacity >= 1000", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Cap_Over_1000` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Cap_Over_1000`)),   
               color="brown", linetype="dashed", linewidth=1)


pct_Usage_Capacity_Under_1000 = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_Cap_Under_1000`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Usage capacity < 1000", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Cap_Under_1000` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Cap_Under_1000`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_managed = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_managed`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Managed", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_managed` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_managed`)),   
               color="brown", linetype="dashed", linewidth=1)


pct_rural = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_rural`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "% Rural", y = "Frequency") +
  geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_rural` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_rural`)),   
               color="brown", linetype="dashed", linewidth=1)


pct_within_median_dist_to_pri_road = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_w_meddist_to_PriRoad`)) +
              geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
                labs(x = "% Rural", y = "Frequency") +
                geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_PriRoad` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_PriRoad`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_within_median_dist_to_sec_road = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_w_meddist_to_SecRoad`)) +
              geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
                labs(x = "% Rural", y = "Frequency") +
                geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_SecRoad` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_SecRoad`)),   
               color="brown", linetype="dashed", linewidth=1)

pct_within_median_dist_to_ter_road = ggplot(data=nga_wp_interested_data_pts, 
             aes(x= `pct_w_meddist_to_TerRoad`)) +
              geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
                labs(x = "% Rural", y = "Frequency") +
                geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_TerRoad` )),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_TerRoad`)),   
               color="brown", linetype="dashed", linewidth=1)


ggarrange(functional, nonfunctional, 
          functional_pct, nonfunctional_pct,
          ncol = 2, nrow = 2)

ggarrange(pct_Potable, pct_NonPotable, 
          pct_Usage_Capacity_Over_1000, pct_Usage_Capacity_Under_1000,
          pct_managed, pct_rural,
          ncol = 2, nrow = 3)

ggarrange(pct_within_median_dist_to_pri_road, 
          pct_within_median_dist_to_sec_road, 
          pct_within_median_dist_to_ter_road, 
          pct_Handpump,
          ncol = 2, nrow = 2)

From the charts, we can observe that

  • No. of functional water points & non functional water points are positively skewed

  • percentage of functional follows closely to a normal distribution while percentage of non functional is positively skewed

  • percentage of water points with potable water is negatively skewed while percentage of water points with non potable water is positively skewed

  • percentage of water points with usage capacity over 1000 is positively skewed while percentage of water points with usage capacity less than 1000 is negatively skewed

  • percentage of managed water points is negatively skewed

  • percentage of rural water points is negatively skewed

  • percentage of water points with primary road within median distance is negatively skewed

  • percentage of water points with secondary road within median distance is negatively skewed

  • percentage of water points with tertiary road within median distance is negatively skewed

  • percentage of water points with hand pumps is negatively skewed.

As functional and non functional water points are NOT of the same units as the other variables in percentage, we will need to normalize it later.

Visualizing the spatial distribution of water points

We will plot the map with the tmap package. In this exercise we will use the jenks style as it looks for clusters of related values and highlights the differences between categories. (Nowosad, 2019)

nga_wp_interested_data_pts.map.pct_func = 
  tm_shape(nga_wp_interested_data_pts) + 
  tm_fill("pct_functional", 
          palette ="PuRd", style="jenks", n = 5) +  
  tm_borders(alpha=0.5) + 
  tm_grid (alpha=0.2) +
  tm_layout(main.title="% functional WP - 2 Layer map", 
            main.title.position="center", 
            main.title.size=0.8, 
            frame = TRUE) 
 
nga_wp_interested_data_pts.map.pct_nonfunc = 
  tm_shape(nga_wp_interested_data_pts) + 
  tm_fill("pct_non-functional", 
          palette ="PuRd", style="jenks", n = 5) +  
  tm_borders(alpha=0.5) + 
  tm_grid (alpha=0.2) +
  tm_layout(main.title="% non functional WP - 2 Layer map", 
            main.title.position="center", 
            main.title.size=0.8, 
            frame = TRUE) 

tmap_arrange(nga_wp_interested_data_pts.map.pct_func, nga_wp_interested_data_pts.map.pct_nonfunc, asp=1, ncol=2)

From the results, we can observe the following:

  • The north east area of Nigeria have little to no water point at all

  • Functional water points congregate in the northern side of the country

  • Non functional water points congregates around the south western coast of Nigera, facing gulf of Guinea & Bight of Benin.

Correlation Analysis

Finding Correlated Variables

It is important that we ensure the cluster variables are not highly correlated before we conduct cluster analysis.

We will first remove the geometry object from the data frame using st_set_geometry(NULL)

nga_wp_corr_vars = nga_wp_interested_data_pts %>%
          st_set_geometry(NULL)

We will use corrplot.mixed() (ref) function of the corrplot package. However we need to find the correlation matrix first with cor() and only use the variables we are interested in, which are in column 1 to 17.

cluster_vars.cor = cor(nga_wp_corr_vars) 

q = corrplot.mixed(cluster_vars.cor, 
               lower = "ellipse", 
               upper = "number", 
               tl.pos = "lt", 
               diag="l", 
               tl.col="black")

According to Calkins (2005), variables that can be regarded as having a high degree of correlation are indicated by correlation coefficients with magnitudes between 0.7 and 1.0. As such, the following 5 observations are highly correlated from the results

  1. pct_functional - Percentage of functional water points

  2. pct_handpump - percentage of water points with hand pumps

  3. pct_potable - Percentage of water points with potable water

  4. pct_Cap_Over_1000 - percentage of water points with capacity over 1000

  5. pct_w_meddist_to_PriRoad - percentage of water points within median distance of a primary road

In view of that we will just keep pct_functional by using the select statement

nga_wp_corr_vars = nga_wp_corr_vars %>%
                   select(1:4, 7, 9:11, 13:14)

Data Standardization

Reference

Calkins K. G (2005) Applied Statistics - Lesson 5, Correlation Coefficients

https://www.andrews.edu/~calkins/math/edrm611/edrm05.htm#:~:text=Correlation%20coefficients%20whose%20magnitude%20are,can%20be%20considered%20highly%20correlated.

Nowosad J. (2019), Map coloring: the color scale styles available in the tmap package https://geocompr.github.io/post/2019/tmap-color-scales/

ONG, J (2022), Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#visualising-of-distribution-using-ggplot